The Cumulative Distribution Function as the Unsung Hero of Causal Inference
Partial Identification
Causal Inference
Author
John Graves
Published
October 5, 2023
Part one in a six-part series on partial identification methods for causal inference and policy decisionmaking.
Setup R session
library(tidyverse)library(here)library(glue)library(copula)library(knitr)library(kableExtra)library(ggsci)library(MASS)library(psych)library(patchwork)library(directlabels)library(progress)library(latex2exp)select <- dplyr::selectoptions("scipen"=100, "digits"=6)theme_set(hrbrthemes::theme_ipsum())library(tufte)# invalidate cache when the tufte version changesknitr::opts_chunk$set(cache.extra =packageVersion('tufte'))options(htmltools.dir.version =FALSE)# knitr::purl(input = "./partial-id-intro.qmd",# output = "./partial-id-intro.r")
Introduction
Suppose a policymaker is considering the choice to implement and/or extend a policy that aims to improve health outcome \(Y\) (e.g., survival). The policymaker comes to you and asks for a full-scale evaluation to inform this decision. You have at your disposal observational data before and after a similar policy change—or, if resources and time are in your favor, you may elect to prospectively design an evaluation of the policy change.
Take a moment to think through the methods you might draw on, and the quantities of interest you might estimate with those methods. In all likelihood you are probably considering estimating an average treatment effect on the treated, or perhaps a complier average treatment effect, which are the workhorse estimands across the vast program and policy evaluation literature.
To estimate these quantities of interest you might draw on difference-in-differences design, or perhaps an instrumental variables analysis. Alternatively, if you able to design a prospective study, you might implement a randomized evaluation of the policy change.
Now step back for a moment and consider what might guide the policymaker’s decision independent of what you can actually estimate in your evaluation. What criteria might she be interested in to inform her choice?
The policy intervention improves outcomes, on average, among those treated.
The policy intervention incurs little-to-no harm among those treated.
The policy has a treatment effect of at least some threshold amount for the majority of those treated.
Each of these seem like reasonable and justifiable criteria to make policy decisions.
Now map each of these decision criteria to the parameters of interest you can estimate. Assuming your study is adequately powered, the only quantity you can directly estimate is #1. That is, you will likely be able to inform decisions based on the average treatment effect—but you will very likely have little to nothing to say about whether anyone was harmed, or more generally what the distribution of the treatment effect looks like in the affected population.
The goal of this blog series is to extend the policy evaluation toolkit to provide practitioners with the tools to estimate #2 and #3. In addition, we will provide additional tools to produce bounds on estimates of #1 under minimal assumptions—which can then be narrowed (at the limit, into a single point estimate) by layering in additional assumptions, or by making stronger assumptions.
Searching for the keys under the difference-in-differences lamppost.
How does this differ from usual practice in the social sciences? Econometrics, and the social science in general, tends to start from the perspective of point identification. This is the idea that we need to go as far out on the identification limb as possible to get a single, scalar point estimate of our quantity of interest.
But now think about the assumptions often needed to achieve point identification—particularly in observational research settings. These assumptions are often strong, perhaps even heroic. This dilutes the credibility of policy evaluations, opening up legitimate questions over whether decisions based on the evaluations are grounded in realistic assessments of policy impact.
But what if you could produce actionable information with fewer assumptions. For example, you might be able to rule out harmful treatment effects without having to make heroic assumptions. Or if the actionable treatment effect threshold is \(\lambda\) (i.e., the policy is worth implementing if the average treatment effect is at least \(\lambda\)), what if you could rule out that the true average treatment effect is less than \(\lambda\) while making fewer difficult/impossible-to-validate identifying assumptions?
The Cumulative Distribution Function as the Unsung Hero of Causal Inference
The “Perfect” Doctor Data
Our running example will draw from an adapted and extended version of the “perfect doctor” data often used to teach causal inference methods (Rubin 2004; Imbens and Rubin 2015).
In the classic “Perfect Doctor” example, we are given potential outcomes for ten sample units. Critically, we observe potential outcomes under treatment and control. The outcome is envisioned as survival, so an increase in the outcome is interpreted as improving health, while a decrease is interpreted as harm.
In the original example, treatment is endogenously determined by a “perfect” doctor who always selects the treatment that benefits the patient. That is, if a given patient would survive 5 years with the treatment, but 7 years without it, the doctor would choose not to treat the patient. By comparison, if the patient would survive 3 years with the treatment, but 1 year without, the doctor would “assign” the patient to the treated group.
This example will differentiate from the original in the following way: we will (for now) assume that treatment assignment is exogenous—that is, it is randomly assigned. We’ll bring back in endogenous treatment assignment mechanism in later posts that cover partial identification approachs in observational data settings. So in that sense, the data we’ll use is “perfect” only in the sense that we observe both potetnial outcomes for each unit—not that the doctor is perfect in terms of matching up each patient to the best treatment for them.
Let’s now look at 10 rows of the example data:
Generate the perfect doctor data
df <-# basic perfect doctor data (from Rubin)data.frame(patient = LETTERS[1:10],Y_1 =c(7,5,5,7,4,10,1,5,3,9),Y_0 =c(5,3,8,8,3,1,6,8,8,4)) %>%mutate(Y_0 =c(1,6,1,8,2,1,10,6,7,8)) %>%mutate(delta = Y_1 - Y_0) %>%mutate(D =c(1,1,1,1,1,0,0,0,0,0)) %>%mutate(Y_obs =case_when(D==1~ Y_1, TRUE~ Y_0)) %>%mutate(sim =0) %>%mutate(X =rnorm(10,mean =0, sd =1)) # Simulate additional data from a copula that defines the correlation b/t potential outcomes, and# between a single covariate X and the potential outcomes. m <-3# number of columnsn <-10000-10# sample sizecorXY <-0.8# corelation between potential outcomes and Xsigma <-matrix(c(1,0.6,corXY,0.6,1,corXY,corXY,corXY,1),nrow =3) # correlation matrix# Sample the joint CDF quantiles from a multivariate normal centered at 0set.seed(100)z <-mvrnorm(n,mu=rep(0, m),Sigma=sigma,empirical=T) u <-pnorm(z) # get the quantiles associated with each value# Generate the variablesdelta <-# Treatment effect varies, but has mean 2 and SD 1. rnorm(n,mean =2, sd =1)sY_1 <-# simulated potential outcome under Txqpois(u[,1],lambda =mean(df[df$D==1,]$Y_1)) + deltasY_0 <-# simulated potential outcome under non-Txqpois(u[,2],lambda=mean(df[df$D==1,]$Y_1)) +rnorm(n=n)X <-# a single covariate that is predictive of the outcome. qnorm(u[,3],mean =0, sd =1)df_sim <-# Construct the final data. tibble(patient =glue("sim{1:n}")) %>%mutate(Y_1 = sY_1,Y_0 = sY_0,X = X) %>%mutate(delta = Y_1 - Y_0) %>%mutate(D =as.integer(runif(nrow(.))<.5)) %>%mutate(U =rnorm(nrow(.))) %>%#mutate(D = as.integer(Y_1>=Y_0)) %>% mutate(Y_obs =case_when(D==1~ Y_1, TRUE~ Y_0)) %>%mutate(sim =1)df <-# Bind it with the general example rows (10 rows) df %>%bind_rows(df_sim)
Table 1: Example Data
patient
Y_0
Y_1
D
A
1
7
1
B
6
5
1
C
1
5
1
D
8
7
1
E
2
4
1
F
1
10
0
G
10
1
0
H
6
5
0
I
7
3
0
J
8
9
0
In the data shown in Table 1, the first five rows are treated while the last five are untreated.
Unit-Level Treatment Effects
Because (for now) we observe both potential outcomes, we have all the information we need to define unit-level treatment effects. Let’s define delta as the unit-level difference in potential outcomes:
Table 2: Example Data With Unit-Level Treatment Effects
patient
Y_0
Y_1
D
delta
A
1
7
1
6
B
6
5
1
-1
C
1
5
1
4
D
8
7
1
-1
E
2
4
1
2
F
1
10
0
9
G
10
1
0
-9
H
6
5
0
-1
I
7
3
0
-4
J
8
9
0
1
Because we can calculate delta for each unit, we can do all sorts of things. For example, we can look at the distribution of delta in our full sample (of 10,000 units) to get a sense of the distribution of the treatment effect. Figure 1 below plots this distribution.
Figure 1: Distribution of the Treatment Effect (delta)
We see from Figure 1 that the treatment effect is positive, on average, but is negative for some individuals. This is relevant, actionable information not captured by the average treatment effect alone!
Decision Thresholds for Policy Evaluation
Before we move on, it’s worth pausing for a minute to consider the evidence thresholds or decision rules that might be relevant to a policymaker. Doing so can help higlight the limitations of standard approaches to policy evaluation, and can also help elucidate additional dimensions of exploration partial identification approaches open up.
Suppose, for example, that a policymaker faces a decision over whether to expand or continue a treatment or program. What rules might she take into consideration when making this decision?
It may be, for example, that the policymaker simply needs the policy or program to be beneficial, on average, in the population. Or, the policymaker may additionally (or solely) want to ensure that the policy does little to no harm—that is, that less than some maximum threshold fraction (\(\lambda_1\)) of the population is made worse off under treatment. Alternatively, the policymaker may wish to ensure that the policy has a minimum treatment effect of at least \(\lambda_2\) for a majority of the population.
These are all reasonable and useful decision criteria. **A key challenge is that even under the most ideal experimental designs, we can’t estimate many these quantities of interset—at least, not without some additional (often heroic) assumptions.
Estimating The Fraction Harmed by Treatment
Before we get to the reasons why we often cannot estimate many of the above parameters of interest, let’s first calculate one of them in our “perfect” data: the fraction of the population harmed by treatment.
To estimate this fraction, we can plot the empirical cumulative distribution function (eCDF) of delta among the treated observations:
Recall that the CDF for some variable X is the function \(F_X(x)=\mathrm{P}(X \leq x)\)
Figure 2: Cumulative Distribution Function of the Treatment Effect (delta)
The red-shaded region in Figure 2 shows that 22 percent of treated samples had a negative (unit-level) treatment effect value for delta.
The Fundamental Problem of Causal Inference
We have seen above that with perfect data, we can estimate that, on average, the treatment is beneficial: it increases survival by 2 years. But we can also calculate that 22 percent of the population is harmed. Armed with this information, a policymaker could weigh the average benefit of the policy against the observation that some people are made worse off.
A major challenge arises, however, because we almost never observe the perfect data we were able to draw on above.
What is the underlying issue that precludes us from estimating the quantities of interest? The simple reason is the running example above was based on a scenario in which we had full information on potential outcomes for each unit in our population. That is, we observed each unit’s outcome under treatment, and their outcome without treatment. Because of this, we can calculate a unit-level treatment effect estimate delta and plot the CDF of that (Figure 2). However, this is an idealized data scenario that is almost never the case.
As many intoductory causal inferences course will teach you, what is more typical is to observe just one realization of the potential outcomes. Among treated units we observe the realized value of \(Y(1)\) , and among untreated units we observe the realized value of \(Y(0)\). Each unit’s other (counterfactual) potential outcome is therefore missing from observation.
Table 3 re-casts the data above to show what we would observe in most real-world settings. That is, we only observe values in the column Y_1 for treated observations; and likewise, we only observe values in the column Y_0 for untreated observations. We then collect these observed values in a new column called Y_obs (Y-observed). And because we do not observe both potential outcomes for each unit, we cannot calculate unit-leve values of delta at all.
Table 3: Perfect Doctor Data
patient
D
Y_obs
Y_0
Y_1
delta
A
1
7
?
7
?
B
1
5
?
5
?
C
1
5
?
5
?
D
1
7
?
7
?
E
1
4
?
4
?
F
0
1
1
?
?
G
0
10
10
?
?
H
0
6
6
?
?
I
0
7
7
?
?
J
0
8
8
?
?
Scale Dependence of Common Estimators
In addition to expanding the toolkit of policy- and decision-relevant parameters that can be estimated, analyzing policy impact using the CDF also helps highlight how we can test & address another issue: scale-dependence of common estimators. This is an area covered in detail in Roth and Sant’Anna (2023), and in our my own recent work on Difference-in-Differences for categorical outcomes (Graves et al. 2022).
In short, for parallel trends assumptions to be robust to functional form, need to identify the entire distribution of counterfactual outcomes for the treated group.
parallel trends assumption is invariant to transformations if and only if a “parallel trends”-type assumption holds for the entire cumulative distribution function (CDF) of untreated potential outcomes
There are three cases in which this condition holds:
If the distribution of potential outcomes is the same for both groups, as occurs under random assignment of treatment.
If the potential outcome distributions for each group are stable over time.
A hybrid of the first two cases in which the population is a mixture of a sub-population that is effectively randomized between treatment and control and another sub-population that has non-random treatment status but stable potential outcome distributions across time.
demonstrate that parallel trends is either a functional form restriction or a combination of unconfoundedness and stationary assumptions
How Can We Recover Useful Treatment Effect Parameters?
In this example, we have simulated data under a hypothetical experimental design (i.e., treatment D is exogenously assigned and is therefore independent of the potential outcomes).
What can we point identify using the marginal distributions of the outcome (by treatment status) under this experimental design?
First, we can identify Average Treatment Effect on the Treated (ATT):
The ATT is given by \(ATT = E[Y_{1t}-Y_{0t}|D=1]\)
Table 4: Mean Outocomes by Treatment Status, and Average Treatment Effect Estimate
Untreated
Treated
hat_delta
5.584
7.619
2.035
This is getting us somewhere. We learn that, on average, the treatment was beneficial in extending survival—but crutially, we can’t rule out that it harmed some units in the population.
We can also define the Quantile Treatment Effect on the Treated (QTT):
Is it possible to recover useful and actionable information on this question? Yes, it is! It turns out that we can’t point identify some quantities of interest without strong assumptions, but we can set identify them with minimal assumptions. In other words, without having to go do extraorinarily lenghts in terms of assumptions, we can put upper and lower bounds on the fraction of the population harmed by treatment, or on the fraction of the population with treatment effect value greater than or less than some critical value \(lambda\).
Treatment Effect Bounds
This blog series is constructed to provide information on the theory (mostly intuition) and the code needed to construct treatment effect bounds on both the distribution of the treatment effect, and on the average treatment effect. Both approches share a common DNA in that they start from a “worst-case” method that makes the absolute minimal assumptions. We can then start to tighten the estimated bounds–either by conditioning on (exogenous) covariates, or by making assumptions on the joint distribution of potential outcomes and/or the joint distribution of unit-level observations at different points in time.
We will get into the mechanics and theory in later sections, and will only run through a quick example of the utility of these methods below.
Bounds on the Distribution of the Treatment Effect
We’ll first construct worst-case bounds based on techniques covered in Abbring and Heckman (2007), Williamson and Downs (1990) and Frank, Nelsen, and Schweizer (1987).
Sharp bounds on the joint distribution of the potential outcomes with identiÖed marginals are given by the Frechet-Hoe§ding lower and upper bound distributions, see Heckman and Smith (1993), Heckman, Smith, and Clements (1997), and Manski (1997b) for their applications in program evaluation.
Figure 4: Bounds on the Cumulative Distribution Function of the Treatment Effect (delta)
Figure 11: Fransden-Lefgren Treatment Effect Distribution Bounds With Covariates
Bounds on the Average Treatment Effect
Figure 12: Worst-Case Bounds on the Average Treatment Effect on the Treated, by Method
Decision Rules
Figure 13: Decision Thresholds Based on the Maximum Allowable Fraction of the Population Harmed by Treatment
Figure 14: Decision Thresholds Based on the Minimum Treatment Effect
References
Abbring, Jaap H., and James J. Heckman. 2007. “Econometric Evaluation of Social Programs, Part III: Distributional Treatment Effects, Dynamic Treatment Effects, Dynamic Discrete Choice, and General Equilibrium Policy Evaluation.”Handbook of Econometrics 6: 5145–5303.
Frandsen, Brigham R., and Lars J. Lefgren. 2021. “Partial Identification of the Distribution of Treatment Effects with an Application to the Knowledge Is Power Program (KIPP).”Quantitative Economics 12 (1): 143–71.
Frank, M. J., R. B. Nelsen, and B. Schweizer. 1987. “Best-Possible Bounds for the Distribution of a Sum a Problem of Kolmogorov.”Probability Theory and Related Fields 74 (2): 199–211. https://doi.org/10.1007/BF00569989.
Graves, John A., Carrie Fry, J. Michael McWilliams, and Laura A. Hatfield. 2022. “Difference-in-Differences for Categorical Outcomes.”Health Services Research 57 (3): 681–92.
Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
Roth, Jonathan, and Pedro HC Sant’Anna. 2023. “When Is Parallel Trends Sensitive to Functional Form?”Econometrica 91 (2): 737–47.
Rubin, Donald B. 2004. “Teaching Statistical Inference for Causal Effects in Experiments and Observational Studies.”Journal of Educational and Behavioral Statistics 29 (3): 343–67.
Williamson, Robert C., and Tom Downs. 1990. “Probabilistic Arithmetic. I. Numerical Methods for Calculating Convolutions and Dependency Bounds.”International Journal of Approximate Reasoning 4 (2): 89–158.